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O" 1 Summary. The expected level of linkage disequilibrium (LD) in a finite ideal population at equi- 

librium is of relevance for many applications in population and quantitative genetics. Several 
^vq recursion formulae have been proposed during the last decades, whose derivations mostly 

j> contain heuristic parts and therefore remain mathematically questionable. We propose a more 

VO justifiable approach, including an alternative recursion formula for the expected LD. Since the 

l/"~) exact formula depends on the distribution of allele frequencies in a very complicated manner, 

00 we suggest an approximate solution and analyze its validity extensively in a simulation study. 

Compared to the widely used formula of Sved, the proposed formula performs better for all pa- 
t^j- rameter constellations considered. We then analyze the expected LD at equilibrium using the 

(^ theory on discrete-time Markov chains based on the linear recursion formula, with equilibrium 

C<~) being defined as the steady-state of the chain, which finally leads to a formula for the effective 

t-h population size N e . An additional analysis considers the effect of non-exactness of a recur- 

sion formula on the steady-state, demonstrating that the resulting error in expected LD can be 
• r- 1 substantial. In an application to the HapMap data of two human populations we illustrate the 

^ dependency of the TVe-estimate on the distribution of minor allele frequencies (MAFs), show- 

ing that N e can vary by up to 30% when a uniform instead of a skewed distribution of MAFs is 
taken as a basis to select SNPs for the analyses. Our analyses provide new insights into the 
mathematical complexity of the problem studied. 

Keywords: linkage disequilibrium, effective population size, Markov chain, HapMap, allele fre- 
quency spectrum 

1. Introduction 

In genetics research, the decay of linkage disequilibrium (LD) as a function of the distance 
of the considered loci is an important characteristic of a population. One measure of LD 
between two loci which has widely been used in the literature is r 2 (cf. Hill and Weir (1994)), 
which depends on the frequencies of gametes in the considered population. 
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Moreover, it is commonly assumed that a finite population of size N with constant 
recombination rate c over time for a given pair of loci achieves a state of "equilibrium" after 
a certain time. Usually, this state of equilibrium is said to be reached when the expected 
amount of LD does not change from one generation to the next. 

The effective population size N e , which is defined as the size of an ideal population 
at equilibrium with the same structure of LD as the population under consideration (cf. 
Hcdrick (2011)), is an important population parameter when considering how real popula- 
tions evolved over time. In practice, N e cannot be measured but LD can. Hence, efforts 
have been made to link the two quantities by formulae of the form E(r 2 ) m f(c,N e ), with 
a function / depending on c and N e . 



1.1. Sved's formula for the expected linkage disequilibrium (Sved, 1971) 

The following formula for the expected LD in a population was proposed by Sved (1971) 
and has been used extensively to estimate N e : 

E(r 2 ) = - (Sved's formula) (1) 

y ' 1 + 4N„c 



This equality can be written as 

N R = 



Eg) - 1 = 1-E(r 2 ) 
4c 4cE(r 2 ) : 



and by using an empirically estimated E(r 2 ), the effective population size N e can be cal- 
culated. Hayes et al. (2003) argue that the estimated N e then corresponds to an effective 
population size ^- generations ago, if one assumes that the population grows linearly over 
time. In the following, we will refer to formula (1) as "Sved's formula". Considering two 
loci, Sved (1971) derived this formula based on the following recursion formula for the con- 
ditional probability Qt of identity by descent (IBD) at the second locus, given that two 
sampled gametes from the population are IBD at the first locus in generation T: 

Q T =(l- ^-) (1 - c) 2 Q T -i + ^(1 " c) 2 (Sved's recursion formula) (2) 

Note that this recursion formula is of linear form Qt = clQt-i + b with constants a and 
b. Sved claims that Qt — K(r T ), where r T is the LD after T generations. Additionally, 
equilibrium is considered to be the point in time for which Qt+i = Qt- Based on this 
definition, the equation Qt = E(r^) combined with eqn (2) yields approximately eqn (1) 
for small values of c and after replacing N with N e . 

Sved's formula has been used in different areas of research and applications, ranging from 
animal breeding (de Roos et al., 2008; Flury et al., 2010; Meuwissen et al., 2001; Qanbari 
et al., 2010) and plant breeding (Remington ct al., 2001) to human genetics (McEvoy 
et al., 2011; Tenesa ct al., 2007), and it has become one of the standard approaches for 
./Ve-estimation. 



1 .2. Mathematical shortcomings of previous derivations 

Several other derivations of the formula have been suggested in the last forty years (Sved, 

2008, 2009; Sved and Feldmann, 1973; Tenesa et al., 2007). We found that all derivations 
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are in some parts of heuristic nature, including mathematical gaps or unsound conclusions. 
Indeed, concerns over the validity of the formula and their derivations have already been 
raised by Sved (cf. Sved (2008), p. 185, and a manuscript published on Sved's personal home- 
page http://www.hcLnd.songenetics.com/PIFFLE/LinkageDisequilibrium.pdf). In the 
following, we will sketch some of the mathematical concerns unfolding in these derivations. 

In the manuscript mentioned above Sved reports a misunderstanding in the original 
derivation (Sved, 1971), in which eqn (2) is derived, stating that the recursion formula (2) 
should have been Qt = (l — jj) (1 — c) 2 Qt-i + ^(1 — c) 2 . But this would not lead to Sved's 
formula at equilibrium. It is further argued that a second misunderstanding seems to cancel 
out the first one leading to eqn (2) again, but some uncertainty about the correctness of 
the equations remains, as stated by Sved in the manuscript mentioned above. 

A second key step in this derivation is the equation Qt = ~E.(r T ) which finally leads 
to eqn (1) at equilibrium. To justify this equation, the following argumentation is used: 
Imagine, a gamete is sampled at random from the population. A second gamete with 
the same genotype at the first locus is sampled afterwards. The genes at the first locus 
are said to be identical by descent (IBD) per definition. Then, Sved uses the formula 
Pb + Pb- f° r the probability of homozygosity at the second locus, where ps 1 and pb 2 
are the corresponding allele frequencies. The expression p B + p 2 B is the unconditional 
probability of homozygosity, not taking into account the homozygosity at the first locus in 
LD, while the conditional probability is expected to be greater than p B + p B ■ 

Sved and Feldmann (1973) rediscuss this approach and propose a modified recursion 
formula which is Qt = (1 — ^j (1 — c) 2 Qt-i + jn, but the proof of E(r T ) = Qt is 
lacking. 

Finally, another approach is presented in Sved (2008, 2009) by combining the concepts 
of correlation of two loci and probability of IBD. The critical point in these derivations is 
that correlations are assumed to be additive. However, this assumption is only verified for 
the one-locus case, and a proof for the required two-locus case is still missing. 

Tenesa et al. (2007) provide a shorter derivation of Sved's formula using the equation 
E(r t+ i) = (1 — c)r t . Here, the left-hand side is a constant, whereas the right-hand side 
is a random variable. Additionally, Var(r) w ( ~ r ' - is used as a general expression 
for the sampling variance of an estimate of a correlation coefficient r with sample size n. 
In this context, it is not distinguished between the true underlying correlation p and the 
empirical correlation coefficient r. It is not stated either for which underlying distribution 

/-l 2\2 

this formula can be applied. According to Hotclling (1953), Var(r) ss - — ^-*— holds for a 
bivariate normal distribution. Note that the numerator is squared, whereas this is not the 
case in the formula used by Tenesa et al. (2007). It is unclear, whether and how the formula 
used by Tenesa et al. (2007) is related to the result of Hotclling (1953), since in the case 
of LD the underlying distribution is bivariate Bernoulli, and approximation by a bivariate 
normal distribution is questionable in this case. 

All points of critique mentioned so far stress the need for a clearer approach and an 
extensive empirical analysis of the existing formulae. 



1.3. Organization of the paper 

This paper is organized as follows: We first propose an alternative linear recursion formula 
for the expected LD in a finite population and analyze its validity in an extensive simulation 
study. The new formula is also compared to Sved's recursion formula, and the dependency 
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of the precision of both formulae on the constellation of allele frequencies is analyzed. 

We then consider the expected LD at equilibrium in the mathematical framework of the 
theory on discrete-time Markov chains. On the basis of a (linear) recursion formula, we 
derive a formula for the expected amount of LD at equilibrium, leading to a formula for 
the effective population size N e . First, the derivation is given under the assumption that 
the recursion formula is exact. We then analyze how the non-exactness of a linear recursion 
formula affects the result for the expected LD at equilibrium. 

In an application, we estimate effective population sizes for the human HapMap data 
(The International HapMap Consortium, 2003) using records of two populations. To illus- 
trate the impact of the allele frequency spectrum used, this is done for different sampling 
schemes based on minor allele frequencies. 

We finally discuss the practical implications of our findings. 



2. Methods and Results 

2.1. A new recursion formula 

2.1.1. Basic principles and assumptions: 

Hill and Robertson (1968) proposed r 2 as a measure of LD between a pair of loci. With 
two biallelic loci A and B with alleles A\, A2, Bi, B2 and frequencies pa x , Pa 2 > PB t , Pb 2 j we 
denote the frequencies of the genotypes A\B\, A1B2, A2B1, and A2B2 by in, X12, £21, and 
£22, respectively Then, 

2 _ (xiix 22 -X12X21) 2 (o\ 

PA 1 PA 2 PB 1 PB 2 

Note that if we consider the allelic states at the two loci as Bernoulli variables with param- 
eters pa x and pb ± , then r 2 is the square of the correlation coefficient of these two random 
variables. 

In the following, we consider a diploid population of finite size N at some arbitrary point 
T = to in time and two biallelic loci A and B as described above, with gamete frequencies 
x to :— (xt .ii,Xt ,i2,Xt .2i,Xt .22)- Assuming random mating and a constant recombination 
rate c over time, we can calculate the probabilities x£ :— (x' t n,x' t Yi,x' t 2 ii x t 22) f° r 
receiving the four different genotypes when producing an offspring gamete as 

x t ,u = x t Q ,ii ~ cD , x' to>12 = x to< i2 + cD , x' t0i2l =x to ,2i+cD 

and ^to.22 = ^0,22 - cDq, (4) 

with Dq :— Xt ,nXt ,22 — Xt ,i2 x t ,2i- For a detailed derivation we refer to Hcdrick (2011), p. 
528ff and the references therein. We are now interested in the expected squared correlation 
coefficient E Xt {r 2 +1 ) of the two random variables (the allelic states at the two loci) in 
T = to + I, given x to (and hence r 2 o ) from T = to. If we assume that the population 
size is constant, the population in T = to + 1 is formed by 2N gametes, and the absolute 
frequencies of the four types of gametes (n' n , n' l2 ,n 2 i, n'22) '■— 2-/Vx to+ i follow a multinomial 
distribution with parameters 2N and p = (x' to n , x' to i 2 ,x' to 21 , x' t(j 22 ). 
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2.1.2. Analytic expression for the expected LD in the next generation: 

Based on the above assumptions, the exact expected LD in T = to + 1 conditional on x to is 

given by: 

/ 9 \ -rn / \ 1 1 22 1221/ 1 /\ 

where E Xt denotes the expectation with respect to the multinomial distribution with pa- 
rameters 2JV and p — x! t as described above. Analytical treatment of this expectation (i.e., 
expressing it in terms of the probabilities x' toi j) does not seem to be feasible for general 
JV. The open question is now how to deal with the complex formula. Even if one tried to 
approximate the expectation of the ratio by the ratio of expectations (cf. e.g. Hill (1977); 
Ohta and Kimura (1971)), the result would still depend on x to in a very complex manner. 
Therefore, it is reasonable to work with an approximation of this expression, involving only 
r 2 on the right-hand side of eqn (5). 



2.1.3. The alternative recursion formula for the LD: 

According to Sved's approach and based on the assumptions of the previous sections, we 
propose the following form of an approximate recursion formula for the expected LD in the 
population, given the gamete frequencies x tf) in T = to: 

Ex t0 « +1 ) = arl + b = ar\x ta ) + b, (6) 

where a and b are functions of c and TV. Note that r? is in fact a function of x to , which we 
indicate sometimes by writing r 2 (x to ). We further choose 

-(1- C ) 2 (l-^) and b= W ^- c (7) 

Note that this choice differs from Sved's recursion formula only in the value of b (cf. eqn 
(2)) and that we will justify this choice in the subsequent sections. The coefficients a and 
b were determined heuristically followed by a systematic validation. 



2.2. Simulation study to analyze the performance of the new recursion formula 

2.2.1. Simulation set-up: 

The general idea of the simulation study is the following: For a given combination (JV, c, x to ) 
in T — to, we randomly draw A^ amp i c samples of 2N gametes according to the above 
multinomial distribution with parameters 27V and p = x' tQ . For each of these samples, 
x io+ i and the allele frequencies are obtained as empirical gamete and allele frequencies in 
T = to + 1, and r| L + i is calculated according to eqn (3). Then, E Xt (r? +1 ) is approximated 

by averaging over the N Bamp] _ e values of r t 2 o+1 . Given all tuples (JV, c,r 2 o ,E Xto (r 2 o+1 )), we 
can systematically analyze the fit of eqn (6) in combination with eqn (7), as described below. 
The simulation was done for all combinations of JV, c and x to , where 

JVe {2 2 ,2 3 ,...,2 14 } 
c G {0, 0.001, 0.002, . . . , 0.01, 0.02, . . . , 0.5} 
a^n G {0,0.05,0.1,...,!} 
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Xt ,i2 e {0, 0.05, 0.1, . . . , (1 - x t0t u)} , for given x t(J ,u 

Xt ,2i S {0, 0.05, 0.1, . . . , (1 - a; i0) n - x toA2 )} , for given x to . n , and x toA2 . 



Note that Xt 0t2 2 is determined by Xt , 22 = 1 — x t Q ,n — Xt ,V2 — x t ,2i- For each parameter 
constellation, the number of realizations N samp \ c was chosen dynamically, as described be- 
low. Further note that 0.05 was chosen as grid-length for x to , although each component 
of x to can theoretically take only values in {0, ^y, ^, . . . , l}. For N < 32, we simulated 
according to this real grid, but got almost identical results. For large N, the computational 
costs are too high to simulate from the true grid (0, ^j*, gjf' ' • * > •"•}• 

Parameter constellations causing at least one allele frequency to be zero were excluded 
from the analyses, because in this case r\ cannot be calculated. 



2.2.2. Measures of the goodness of fit: 

We propose the following characteristic as a measure of goodness of fit of eqn (6): 



Ex to « +1 )-arf o 



1, 



for given a and b. If equalities (6) and (7) were exact, we would observe F = for all 
possible parameter constellations. In the appendix we show that F is closely related to the 
relative error of the expected LD at equilibrium due to the non-exactness of the recursion 
formula. Note that F is especially sensitive for a misspecification of b in eqn (7). Since 
Ext (?t +i) is unknown, we use 

ExfT^T-n ) - arf 

F - b ^ 

where E Xt (rf +1 ) is obtained from the simulation study. 

Dynamic sampling: 
In the simulation process described above, -/V samp i c was chosen dynamically such that the 
standard deviation of F was approximately constant over all combinations (N, c, Xt ): It is 



s.d.(F) =s.d. , 

1 b 



Ex t0 « +1 )-< i 
b 

s.d. (B^5f +1 )) 



b 

2 



= s.d.« +1 ) 
• y JV sam pi c 

The right-hand side is constant if 



]\T [ ■V tp + 1/ 1 



/s.d.fr?. 

^sample 

with any constant d > 0. To obtain s.d.(r? +1 ), we performed a preliminary simulation 
study according to the same simulation set-up, but with a constant sample size of 10, 000, 
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and the empirical standard deviation of rf +1 was calculated for all (N, c, x to )-constcllations. 
The value of d was chosen such that the maximal value of iV samp i e equaled 5 • 10 6 . This led 
to an average sample size of 462, 800 with a median of 15, 000. 

All statistical analyses were performed with R software (R Development Core Team, 
2012). The R-package "multicore" (Urbanek, 2011) was used to parallelize the simulation. 



2.2.3. Results of the simulation study: 

We found that F was centered around (Figure 1). When all values of F below the 
2.5% and above the 97.5% quantiles were excluded, F ranged between —1 and 1 indicating 
that the recursion formula fits the simulated data reasonably well. Values of F below the 
2.5% quantiles and above the 97.5% quantiles were found to be generated by parameter 
constellations for which 

P : = %t ,l~l- x t ,\2Xt ,2\Xt ,22 

was close to zero, i.e. for constellations in which at least one gamete frequency in T = to 
was close to zero (results not shown). 
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Fig. 1. Density function of F. The left plot shows the density for all obtained values of F. To obtain 
the right density plot, values of F below the 2.5% quantile and above the 97.5% quantile of the 
distribution of F were excluded. 



We used boxplots to display F for different parameter constellations. Boxplots were 
created separately for different values of N, c, P and S :— x to> n+Xt ,i2- Note that S equals 
the allele frequency pa 1 sX the first locus and for symmetry arguments is representative for 
all other allele frequencies. Values of P (and S) were subdivided into 20 (and 15) equidistant 
bins, respectively. Outliers (i.e. values which lie beyond the extremes of the whiskers) are 
not displayed in any of the plots. 

From Figure 2 we can see that the proposed recursion formula fits the data reasonably 
well, both for varying N and c. The bias as a function of N is almost constant, and it 
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decays with increasing c. The goodness of fit depends heavily on P and S: F is larger and 
more variable for small P and for extreme S, but F still ranges between —1.5 and 1.5 in all 
considered boxplots. 



^9B E3 f"3 ("3 E3 E3 ^ C~^ C"~i E3 S 



4 8 16 64 256 1024 4096 16384 
N 



0.01 0.1 0.2 0.3 0.4 0.5 



I " i LdJ ui-i uy i - i LtD n i n Uu ^ cfa EiJ □!□ tip tip dp rjn rjn 



THSS^w*mSHr 



0.001 0.002 0.003 0.004 

P = X 11 X 12 X 21 X 2 2 



0.2 



0.4 0.6 

S = x 11 +x ]2 



Fig. 2. Boxplots of F, separately for different bins of N,c,P :— £1 1x120:21 £22 and 5 := xu + x\%, 
based on the new recursion formula. Here, P is the product of gamete frequencies and S is the 
allele frequency of the first allele at the first locus. F was calculated according to the new recursion 
formula. Outliers (i.e. values which lie beyond the extremes of the whiskers) are not shown. 



Figure 3 shows that Sved's recursion formula does not fit the simulated data as well as 
the new formula, especially for c > 0.01 and for N < 30. This insufficient fit for c > 0.01 
also pertains to the boxplots for varying P and S (as F is averaged over all constellations 
of (N, c, x to ) for a fixed bin of P and S, respectively). For c < 0.01, there are only marginal 
differences between F based on Sved's recursion formula and the new recursion formula. 

Contourplots were drawn for the empirical mean of F 2 , with the mean calculated using 
all values of F obtained for a given combination of values on the vertical and horizontal 
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Fig. 3. Boxplots of F, separately for different bins of N,c,P :— £113:120:210:22 and S ■■— am + W12, 
based on Sved's recursion formula. Here, P is the product of gamete frequencies and S is the allele 
frequency of the first allele at the first locus. F was calculated according to Sved's recursion formula. 
Outliers (i.e. values which lie beyond the extremes of the whiskers) are not shown. 



axis of the contourplot. For example, in the contourplot with axes (N, c) (cf. Figure 4) F 2 
values were averaged over all possible combinations of (xt 0> u, £t ,i2, %t , 21; 2^0.22) f° r a fixed 
combination (N, c). For a clearer representation of the contourplots, we excluded all values 
of F below the 2.5% and above the 97.5% quantiles beforehand. 

The contourplots of Figure 4 show that F 2 depends only slightly on N and c, emphasizing 
the adequate fit of the new recursion formula. Figure 5 approves the previous results on the 
dependency of the goodness of fit on gamete and allele frequencies in T = t : The quality 
of the fit is reduced for S < 0.2 and S > 0.8. The same can be observed for P < 0.0004 as 
well as for AMAF := \min(pA 1 ,PA 2 ) — mm (PBi>Ps 2 )| < 0.2 (results not shown). The term 
AMAF describes the absolute difference in minor allele frequencies (MAFs) of both loci, 
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6 8 10 
log 2 (N) 



Fig. 4. Contourplot of the average values of F 2 . For a given combination of (log 2 (JV), c), the values 
of F 2 were averaged over all possible combinations of x io . Contourplots were created after excluding 
the extreme 2.5% quantiles of F. 



With p A2 = 1 - p Al = X to ,21 + X to ,22 and PB 2 = 1 - PB X = X U) A2 + X to ,22- 

Note that similar structures in the contourplots with respect to the dependencies on the 
allele frequencies can be observed for the goodness of fit of Sved's recursion formula (results 
not shown). 

Overall, the results confirm that an exact recursion formula must depend on the gamete 
frequencies. However, this would lead to complex formulae, especially for the state of 
equilibrium, which we will discuss in the appendix. 



2.2.4. Comparison of slope and intercept of the recursion formulae to empirical values: 
We compared the new recursion formula (eqn (6) in combination with eqn (7)) to the 
recursion formula of Sved (eqn (2)) by plotting the slope a against c for a given N. The 
same was done for the intercept b. Given N and c, we also fitted a linear regression model 
to the tuples (rf ,E Xt (r? +1 )) from the simulation study and added the points (c,a) (and 
(c, b), respectively) to the plots, where a and b were the estimated slope and intercept from 
the regression model. The results are shown in Figure 6. 

The slopes a for the two recursion formulae are identical and coincide well with the 
empirical ones, with a better agreement for larger N (Figure 6). The intercepts, though, 
differ greatly remarkable between the two approaches, especially for c > 0.1 and for small 
N. The intercepts according to Sved's formula are not in agreement with the empirical ones 
for small N and c > 0.1. This is also reflected in the large values of F for increasing c (cf. 
Figure 3). Differences between the intercepts according to Sved's recursion formula and the 
new recursion formula become less pronounced for increasing N and for decreasing c. 

We tried to improve the fit of a and b based on Figure 6, especially for small N, by 
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Fig. 5. Contourplot of the average values of F 2 . Left plots: For a given combination of (S, log 2 (iV)), 
the values of F 2 were averaged over all possible values of c and x to (upper plot). The lower plot 
illustrates the average value of F 2 as a function of P for log 2 (iV) = 8. Right plots: For a given com- 
bination of (S, c), the values of F 2 were averaged over all possible values of log 2 (7V) and x to (upper 
plot). The lower plot illustrates the average value of F as a function of P for c = 0.2. Contourplots 
were created after excluding the extreme 2.5% quantiles of P. 



using different formula for a and b in eqn (7), e.g. a = (l — ^) (l — c (l~ Jn)) an d 
b = nAT li — : ■ However this did not lead to a significant improvement in terms of F (results 



2JV+1- 



not shown). Since the true relation between E(r? +1 ) and Ttn+i * s n0 ^ un ear, optimizing a 
(weighted) average of F is relevant. Thus, we would recommend to use Figures 2 and 3 as 
a basis for the assessment of adequacy. 



2.3. The expected LD at equilibrium based on the theory of discrete Markov chains 
2.3.1. Assuming that the recursion formula is exact: 

In previous studies, "equilibrium" was defined as the point in time at which the expected 
LD of the next generation equals the LD of the previous one (see e.g. Sved (1971); Tenesa 
et al. (2007)). Using this definition and assuming a linear recursion formula with coefficients 
a and b (eqn (6)), the expected LD at equilibrium equals j^-. 

Two major problems arise from this definition: Firstly, it is not clear whether this 
equilibrium will ever be achieved. Secondly, one cannot infer from this definition how the 
formula for the expected LD at equilibrium is affected if the recursion formula is not exact 
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Fig. 6. Comparison of slope and intercepts of Sved's and the new recursion formula to simulated 
values. Left plots: The slope a (identical for Sved's and the new recursion formula) is plotted against 
c for different values of N. The black dots are the slopes empirically obtained via linear regression. 
Hereby, the average LD values obtained in T — to + 1 were regressed against r 2 tQ . Right plots: The 
intercepts b of the recursion formulae are plotted against c for different values of N. Blue (red) lines 
indicate Sved's (the new) recursion formula. The black dots are the slopes empirically obtained via 
linear regression. 



but only approximate. 

To overcome these problems, a mathematically deeper definition of equilibrium can 
be given based on the theory of Markov chains, since the sequence of gamete frequencies 
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xr, T = to, to + 1,... forms a homogeneous Markov chain with transition probabilities 
given by the multinomial distribution of the number of gametes of the four types in each 
generation. In this framework equilibrium is defined as the steady-state of the considered 
Markov chain, and the expected LD at equilibrium can be calculated as expectation of r 2 
under the steady-state distribution. 

Under the assumption that the underlying recursion formula is exact, the expected LD 
at equilibrium based on this approach turns out to be 

R := E(rl) = -*- (8) 

1 — a 

for |a| < 1, in concordance with the above formula. A detailed derivation based on the 
Markov chain theory is given in the appendix. Note that despite the apparent coincidence 
with formulae currently used in practice, usually no reference to the Markov chain theory is 
made. The same Markov chain model for the evolution of gamete frequencies has been used 
by Karlin and McGregor (1968), in a study on the ascertainment of fixation probabilities. 

Furthermore, the Markov chain theory has the advantage that it also allows the calcula- 
tion of the expected LD at equilibrium in the case of a non-exact recursion formula. We will 
come back to this issue in the next section, in which we will analyze how this non-exactness 
affects the formula for the expected LD at equilibrium. 

Using a and b of eqn (7) in eqn (8) yields the following formula for the expected LD at 
equilibrium: 

R= i-{i-cni-&) (9) 

i 



(2N - 1 - c) - (2AT - 1 - c)(l - c) 2 (l - gk) 
This formula differs from Sved's formula (eqn (1)). Solving eqn (9) for N yields 

N = 2(8cR l _ 4c2R) (Y + y/-A{8cR - 4c 2 i?)(-i? + cR + c 2 R - c^R) + F 2 ) , (10) 

with Y := 2 — 2R + 8cR, — 2c 3 R, taking into account that N cannot be negative. 

To compare the formulae for the expected LD at equilibrium according to eqns (1) and 
(9), we plotted E(r 2 c ) based on both recursion formulae against c for N G {4, 16, 64, 256}. 
Results are shown in Figure 7. Only small differences between both recursion formulae are 
observed for large N in Figure 7, whereas the difference gets more pronounced, when TV 
is small (cf. Figure 6 where similar effects can be realized). The new recursion formula 
predicts higher values for the expected LD at equilibrium for small values of N. 

Real populations usually do not fulfill the implicit assumptions of ideal populations. It 
is therefore of interest to calculate the effective population size N e based on the average 
LD-value observed from the population for a given value of c. By definition, N e is the size 
of an ideal population at equilibrium with the same structure of LD as the population under 
consideration. In practice, N e is obtained from the right-hand side of eqn (10), using the 
average LD-value observed from the data as value of R. 



2.3.2. Non-exactness of the recursion formulae: 

As indicated above, another problem which has not been discussed in the literature so far 

arises from the non-exactness of the recursion formulae. This problem can also be overcome 
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Fig. 7. Comparison of Sved's formula and the new formula for the expected LD R at equilibrium: R 
is plotted against c for different values of N. Red (black) lines show R according to the new (Sved's) 
formula. The dots indicate the values of R for c = 0, 0.005, . . . , 0.05. 



by the Markov chain theory. 

In the case of a non-exact recursion formula, the Markov chain theory allows to trans- 
fer an error term in the recursion formula to the state of equilibrium. In the appendix, 
we provide the corresponding calculations and show that in case of a non-exact recursion 
formula 



R e 



E M (dJ 



i + ir e 
1-a 
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for \a\ < 1, where 7r* is the stationary distribution of the considered Markov chain and 

e = (e(si),...,e(s z )) 
with e(si) := E Si (rf 0+1 ) - ar 2 { Si ) - b 

is the residual term of the recursion formula depending on the different possible values Si of 
x to . Then, the term ^ — 1 = '^-A measures the relative influence of 7r*e on the expected 
LD. 

To analyze the effect of the non-exactness, we calculated 7L ^ for different (N, c)-combinations. 
For details, we refer to the appendix. The results are listed in Table 1 for N = 4, 8, 16 and 
c = 0.001,0.01,0.05,0.1,0.2,0.3. They illustrate that the error-term can lead to a deviance 
of expected LD up to 25% suggesting that the effect of non-exactness may be non-negligible. 
These analyses were restricted to small values of N, since the calculation times increase 
rapidly with N. 



Table 1. 


Values of ^ 


for different (iV,c)- 


-combinations. 


c 


N = 4 


N = 8, 


N= 16 


0.001 


-0.264 


-0.199 


-0.067 


0.01 


-0.255 


-0.165 


-0.023 


0.05 


-0.176 


-0.088 


0.096 


0.1 


-0.116 


-0.018 


0.195 


0.2 


-0.074 


0.053 


0.297 


0.3 


-0.025 


0.097 


0.324 



Absorbing states were excluded beforehand, and -k* was 
rescaled so that its entries summed up to 1 afterwards. 



To get a first impression on the development of 1L ^ for TV > 16, we also plotted the 
negative mean and the maximum value of e divided by b, given by the terms Si = — - ^ 
and S2 = max ^ ' ei ' , for more values of N and c (Figure 8). Note that this does not incorporate 
the stationary distribution ir* and that e-values in these plots were based on the simulation 
study using the grid of (in, xi2,X2i,X22)-values with fixed grid-distance of 0.05, which is 
not as fine as the "true" grid if N > 20 so that results at this point have to be taken 
with caution. More detailed analyses and a comprehensive simulation study are needed to 
underpin the quantitative results on the influence of e on the expected LD at equilibrium. 



2.4. Application based on the HapMap data 

As an application of the equilibrium-formula based on the proposed recursion formula, we 
estimated N e from LD using human data from the HapMap project (The International 
HapMap Consortium, 2003, 2007) and applying eqn (10) as described in the previous sec- 
tions. We also investigated, how the distribution of MAFs of single nucleotide polymor- 
phisms (SNPs), used to estimate the expected LD in the population, influences the average 
LD values and hence also the estimates of N„ . 
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Fig. 8. Values of Si = '*£ ' (upper left plot) and S 2 = max ' |E ' (upper right plot) for different 
values of N and c, calculated on the basis of the simulation study. Note that the simulation study was 
based on a grid of (a;t ,ii,a;t D ,i2,a:to,2i,a;to,22)-values with fixed grid-length 0.05. Absorbing states 
were excluded beforehand. The left (right) lower plot illustrates the values of Si (S 2 ) as a function of 
log 2 (JV) fore = 0.2. 



2.4.1. The HapMap data set: 



The HapMap data set comprises 270 samples from four populations. In this study, we 
consider two different populations, the Yoruba in Ibadan, Nigeria (YRI) and Utah resi- 
dents with Northern and Western European ancestry from the CEPH collection (CEU). 
For each population, the data comprises 30 trios of individuals. The data are available 
from http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en. For both popula- 
tions, we used allele frequencies from phases II and III (release #27) as well as LD data 
from phases I, II and III (release #27) for SNPs lying on the 22 autosomes, and a correspond- 
ing genetic map (from phase II, estimated from phased haplotypes in release #22 (NCBI 
36)) (The International HapMap Consortium, 2007). LD values were available for markers 
up to 200kb apart. For autosome 22, e.g., there were « 38,000(34,000) SNPs occurring in 
10, 133, 060 (8, 130, 042) LD values for the YRI (CEU) population, for which the genetic dis- 
tance between the corresponding SNP pairs was available. Summing over the 22 autosomes, 
there were in total « 2, 868, 000 (2, 560, 000) SNPs occurring in 701, 820, 000 (563, 239, 000) 
LD values for the YRI (CEU) population. 
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2.4.2. Estimation of N e for the YRI and CEU population: 

We estimated N e separately for the YRI and the CEU population for each of the 22 auto- 
somes, using eqn (10) for the expected LD at equilibrium, with R = E(r^ ), estimated as 
average LD value obtained from the data for given c, and replacing N with N e . 

Following Weir and Hill (1980) we adjusted for the chromosome sample size n by sub- 
tracting - from the sample-based LD values. This is necessary, since even in the case of 
independent loci E(r 2 ) = -. It has been shown by Bishop et al. (1975), p. 382, that nr 2 
has an approximate xi distribution for a bivariate Bernoulli distribution with independent 
components, and hence E(r 2 ) = - in this case. With this adjustment, 



Ne = 2{ % ck _ Ac2R) I Y + V -^ 8CR - Ac2R )(- R + cR + C2R - C3R ) + Y2 ) > ("J 

with Y := 2 - 2R + 8cR - 2c 3 R and R = Elr 2 ) - I. 

For the HapMap data of the YRI and CEU population, n = 120, since sequences of 
30 trios for each population were available, comprising 4 independent parental gametes for 
each trio. Estimates of N e and average LD values were obtained for different bins of the 
recombination rate c. To classify the pairs of SNPs to the bins, c was approximated by the 
genetic distance in Morgan. Note that this approach is admissible for small distances. For 
each autosome, 100 equidistant bins of c ranging from to the maximal genetic distance 
occurring in the data were chosen. For each bin of c, the average r 2 value minus y|g was 
calculated and used in eqn (11) to obtain an estimate of N e . The estimated N e was plotted 
against log 10 (^) (Figure 9). Additionally, a plot of the adjusted average LD value against 
c was generated (Figure 9). Results were plotted for all bins of c containing at least 1,000 
LD values. 

The decay of LD with genetic distance for the YRI and CEU population can be seen in 
the upper plots of Figure 9, estimates of N e are displayed in the lower plots. Note that, 
since N e of human populations is large (N e > 1,000), eqns (9) and (1) basically lead to 
the same estimates (results not shown). N e is smaller for the CEU population (lower right 
plot of Figure 9) and increasing from ss 5,000 to w 10,000 for £- ranging from 1,500 to 
200, whereas N e for the YRI population is decreasing for these values of ^-. Hayes ct al. 
(2003) argue that the N e estimate based on Sved's formula (1) for a fixed c corresponds to 
an estimated N e approximately ^- generations ago, if the population grows linearly over 
time. Having in mind that the new formula (11) and the one based on Sved's formula differ 
only marginally for large N (cf. Figure 7) and that both the YRI and the CEU populations 
are large, we will also apply this concept in the following. Assuming a generation interval 
of 25 years, the above time frame encompasses 37, 500 to 5, 000 years ago, and we find 
N e w 15,000(5,800) for the YRI (CEU) population « 1,000 generations (= 25,000 years) 
ago, as well as N e w 20, 500 (10, 000) for YRI (CEU) « 8, 000 generations (= 200, 000 years) 
ago (cf. Figure 9). 

For values of c with log 10 (^) < 1.75, a high variability of N e values can be observed 
(Figure 9). We hypothesize that the corresponding values of LD observed from the data 
are in the order of magnitude one would expect if loci were independent, in which case it 
would not make sense to estimate N e . Critical values for r 2 could theoretically be derived, 
if the distribution of nr 2 was known for dependent loci (as it is the case for independent 
loci). 
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Fig. 9. LD and estimates of N e for the YRI and CEU population based on the new recursion formula 
displayed for the 22 autosomes. The upper plots show the decay of LD for varying c, estimated 
from SNPs on different autosomes. In the lower plots, the corresponding estimates of N e are plotted 
against log 10 (~). The left (right) plots are for the YRI (CEU) population. N e estimates based on a 
given value of c correspond to the point in time "^ generations ago" (Hayes et al., 2003). 



2.4.3. Influence of distribution of MAFs on the N e -estimates: 

As the detailed analysis of F indicated that the expected LD also depends on the distribution 
of allele frequencies, it is important to investigate how the underlying distribution of MAFs 
affects the estimation of N e . 

In commercial SNP array construction for animal breeding purposes, the use of SNPs 
with uniform MAF distribution is common practice. A uniform distribution of MAFs is 
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in general not pursued in human genetics, but may still occur, e.g. in studies using phase 
I data of the HapMap project (The International HapMap Consortium, 2003), where an 
ascertainment bias can be observed (Clark et al., 2005; Nielsen, 2000; Nielsen et al., 2004; 
Pe'er et al., 2006). 

An enforced uniform distribution of MAFs may introduce a systematic and substantial 
downward bias in N e estimates, especially for historical effective population sizes, which we 
demonstrated with the YRI and the CEU population using data of autosome 22: 

Histograms for the MAF values for all SNPs occurring in SNP pairs for which LD values 
and the genetic distance were available (Figure 10) show that in both populations low 
MAFs are overrepresented. For each population we sampled 10, 000 SNP positions out of 
the « 38,000(36,000) SNPs available for YRI (CEU) according to two different scenarios: 

(1) The 10,000 positions were sampled randomly, i.e. from the true skewed distribution 
of MAFs (cf. Figure 10). 

(2) MAFs were divided into 10 equidistant bins between and 0.5. Then, 1,000 SNPs 
from each bin were sampled to mimic a uniform distribution of MAFs, and all those 
LD values of pairs of SNPs were kept for which the positions of both SNPs were among 
the 10,000 sampled positions. 
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Fig. 10. Histograms of the distribution of MAFs for all SNPs occurring in the LD data of autosome 
22. The left (right) plot shows the histogram for the YRI (CEU) population. 



For each scenario, N e was estimated for different bins of c. We chose 24 equidistant 
bins of c ranging from to sw 0.002 and 25 equidistant bins of c ranging from sw 0.002 
to ps 0.02. The whole sampling process was repeated 100 times. This resulted in 100 
estimated N e values per scenario and per bin of c, for which boxplots were created to 
display the distributions of N e for both scenarios. Boxplots were only created for bins in 
which on average (averaged with respect to the 100 replicates) at least 1,000 LD-values 
were available. 
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Fig. 1 1 . Estimates of N e for the YRI and CEU population for different distributions of MAFs. The left 
(right) plot is for the YRI (CEU) population. Each boxplot represents the distribution of N e estimates 
for a given bin of distance c between SNPs in Morgan. N e was estimated based on the new recursion 
formula. Estimates of N e were obtained for two different scenarios: (1 ) SNP positions were randomly 
sampled, i.e. from a skewed distribution of MAFs. (2) SNP positions were randomly sampled, so that 
the distribution of corresponding MAFs was uniform. The sampling process was replicated 100 times 
and N e was estimated for each replicate, resulting in 100 JV e -estimates per scenario and per bin of 
c. Blue boxplots represent the distribution of iV e -estimates for scenario (1), black boxplots represent 
the distribution of N e -estimates for scenario (2). Only SNPs on chromosome 22 were considered. 
Note that the scale of the x-axis is not linear. N e estimates based on a given value of c correspond 
to the point in time "j- generations ago" (Hayes et al., 2003). 



Figure 1 1 illustrates the influence of the distribution of MAFs of SNP pairs used for LD 
estimation on the N e estimates in the two populations, respectively. N e was estimated for 
different values of ^- , which can be interpreted as the "number of generations ago" in case 
of Sved's formula (Hayes et al., 2003), as described above. The plots demonstrate that the 
N e estimates using a skewed MAF distribution are up to 30% larger than the ones using a 
uniform MAF distribution for large values of £-. For example, for £- = 500, the estimated 
A^ e ranges from w 9, 000 (5, 400) using a skewed MAF distribution to sw 12, 100 (6, 900) using 
a uniform MAF distribution and the YRI (CEU) population. 



2.4.4. Comparison of the HapMap estimates with recent results of other studies: 
Tenesa et al. (2007) used the phase I HapMap data to estimate N e in the YRI and CEU 
population based on « 1,000,000 SNPs from 23 chromosomes. The intermarker distance 
was in the range of 5kb to lOOkb for all SNP pairs. Using only SNPs on autosome 22 
and estimating recombination rates from a nonlinear model, Tenesa et al. (2007) estimated 
N e = 3,246 for the YRI and N e — 1,459 for the CEU population. Overall, their estimates 
appeared to be much lower than the usually quoted value of 10, 000 (Harding et al., 1997; 
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Takahata, 1993). Using a model-free method to estimate recombination rates however 
changed the estimate of N e between +33% and —45%. Results for the YRI population 
indicated an ancestral population size of sw 7, 000 followed by expansion in the last 20, 000 
years (« 1,000 generations), whereas results for the CEU data supported recent dramatic 
population growth from an ancestral population size of ~ 2, 500. 

Our study, based on all 22 autosomes, also indicates a population growth for the CEU 
population (cf . Figure 9) , whereas no growth can be observed for the YRI population. One 
possible reason for the discrepancy of the results may be that Tenesa et al. (2007) used a 
smaller SNP set and different recombination rates, due to the different methods of obtaining 
these rates. More importantly, the results were based on a release of phase I, whereas we 
used phase II data of the HapMap project. Additionally, Tenesa et al. (2007) excluded all 
SNPs with MAF < 0.05 for the LD estimation, whereas no filtering was performed in the 
present study. 

Tenesa et al. (2007) also analyzed the effect of a possible ascertainment bias in the 
HapMap phase I data (Clark et al., 2005; Nielsen, 2000; Nielsen et al., 2004; Pe'er et al., 
2006) by simulating SNPs with complete ascertainment and simulating SNPs according to 
a uniform distribution of MAFs (still excluding SNPs with MAF < 0.05). They found that 
N e estimates were biased downwards by 18% in the second scenario (which is supposed 
to mimick the MAF distribution of the HapMap phase I data) and concluded that this 
was also true for their estimated N e of the HapMap population. As opposed to this, in 
the present study a skewed distribution of MAFs for SNPs occurring in the LD data was 
observed, as illustrated in Figure 10, which is due to the ~ 2.1 million additional SNPs of 
phase II data compared to the phase I data, which comprised only 1.3 million SNPs (The 
International HapMap Consortium, 2007). However, the simulation results of Tenesa et al. 
(2007) qualitatively confirm our results of the previous section with respect to the influence 
of the MAF distribution on the A e -estimates. 

Park (2011) used HapMap phase III data to estimate N e of the current human pop- 
ulation based on two different methods, one using the deviation from linkage equilibrium 
(LE), the other based on the deviation from Hardy- Weinberg Equilibrium (HWE). For the 
YRI population, estimates fluctuated between 1, 275 and 7, 729, depending on the method, 
whereas estimates for the CEU population ranged between 1,331 and 10,437, illustrating 
again the great variability of results. Park (2011) argued that the HWE-based method 
presented the N e of the current generation, whereas the LE-based method reflected values 
of "current and recent" generations. By considering the ratio of HWE- and LE-based esti- 
mates it was found that both populations experienced a recent population growth, which 
was more distinct for the CEU population. 

McEvoy ct al. (2011) also estimated N e based on the HapMap phase III data set using 
Sved's formula and the concept of Hayes ct al. (2003). It was found that the CEU population 
experienced a population growth from N e ~ 5,000 to N e « 11,000 between 800 and 240 
generations ago, whereas N e of the YRI population stayed fairly constant during this time. 
To decrease a possible ascertainment bias, McEvoy ct al. (2011) only used SNPs that were 
segregating in all populations. 

Our estimates of N e are highly variable in size for recent points in time (< 50 generations 
ago, corresponding to < 1, 250 years ago). A similar variability is reported in Tenesa et al. 
(2007), whereas no results are presented in McEvoy et al. (2011) for these points in time. 

In summary, our results agree reasonably well with previously reported findings. Similar 
to the findings of McEvoy et al. (2011), we found the YRI population to be ~ 2.5 times as 
large as the CEU population 1, 000 generations (25, 000 years) ago, while the effective sizes 
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of the two populations converge when considering more recent points in time. The observed 
increase of the European population between 15, 000 to 10, 000 years before present in the 
so-called neolithic expansion is in agreement with archaeological findings and coincides well 
with findings from independent sources, such as the estimations of Fu et al. (2012) based 
on mitochondrial genomes. However, it has been shown that the margin of fluctuation is 
large and that results should always be seen in relation to other existing studies. 



2.4.5. Limitations of the HapMap study: 

As indicated in one of the previous sections, results of the HapMap application have to be 
taken with caution, since the underlying recursion formula is not exact, contrary to what 
is assumed in the derivation of formula (8), which underlies formula (10). Complications 
arising from the non-exactness have neither been considered in previous studies so that 
our results are comparable to the results of other studies from this perspective. The non- 
exactness also pertains to the apparent dependency of the development of LD on allele 
frequencies, for which current recursion formulae do not account. All findings therefore have 
to be considered against the background of the implicit assumptions underlying eqn (10). 
Furthermore, we have indicated that the interpretation of different iV e -values belonging to 
different points in the past according to Hayes et al. (2003) had originally been derived for 
Sved's formula and under the assumption that the population grows linearly over time. In 
our case, considering sufficiently large populations, formula (10) differs only marginally from 
Sved's formula, so that it is reasonable to use the same interpretation, but this approach 
may not be valid in case of small populations. 

3. Discussion 

3. 1 . The influence of SNP array designs on N e -estimates 

We showed in the simulation study as well as in the application to the HapMap data, that 
allele frequencies have a strong influence on the performance of the recursion formula and on 
the estimation of N e . While the true distribution of MAFs in practical applications (e.g., 
in sequencing studies) is usually skewed with a substantial excess of small MAF values, 
commercial SNP arrays are often constructed such that the MAF distribution is uniform, 
i.e., alleles with extreme MAFs are systematically underrepresented (see e.g. Matukumalli 
et al. (2009) for the construction of a density SNP genotyping array for cattle). Hence, 
using LD values based on such an SNP array can have a major impact on estimates of N e 
and may result in biased estimates of N e compared to a situation in which the distribution 
of MAFs is not uniform. A similar bias may appear if an SNP array is constructed to reflect 
the allele frequency spectrum in one population but then is used to estimate N e in other 
populations. 



3.2. Analytic expression vs. approximate recursion formula 

The proposed recursion formula for E Xt (r 2 +1 ) is still not completely unbiased, which can 

be seen in the boxplots of Figure 2. One possibility to reduce the bias is to use b = (1 + m)b 
instead of b where m is chosen such that 



F-m = & E Xto (r t 2 o+1 ) = ar t 2 o +&. 
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The bias is in fact a function of the gamete frequencies, which can be seen when the 
upper plots of Figure 2 are created separately for different bins of P and S (results not 
shown) . This leads back to the problem that an exact recursion formula will depend on the 
frequencies as well. 

Even if it was possible to derive an exact recursion for a specific pair of loci with given 
allele frequencies, many pairs of loci are used to estimate the expected LD, and one would 
have to account not only for the allele frequencies of a single pair of loci but for the whole 
distribution of underlying frequencies, which does not seem to be feasible. 



3.3. Obtaining the expected LD at equilibrium directly 

One general way of obtaining the expected LD at equilibrium, without using any recursion 
formula, is to consider the matrix P of transition probabilities of the Markov chain and 
to calculate the limit of P™ for n — > oo to obtain the stationary distribution of gamete 
frequencies. From this, the expected LD at equilibrium can be calculated directly. However, 
a problem with this approach is that the size of P is ( 2 2J ^ 3 ) x C 2jy 3 ) (there are ( 2 2A ^ 3 ) 
possible states of the Markov chain (Karlin and McGregor, 1968)), which makes numerical 
calculation impossible even for moderately high N. We have already encountered this 
problem when analyzing the term 7L ^ relating to the non-exactness of the recursion formula. 
As an alternative, one could also simulate the Markov chain of gamete frequencies di- 
rectly (instead of calculating P™ for n — ¥ oo) and determine the stationary distribution 
based on the realizations of the Markov chain. This could be done for different values of N 
and c, and the expected LD could be calculated based on the empirically obtained station- 
ary distribution. Afterwards, the expected LD could be expressed as a function of N and c 
which then could be used for the estimation of N e . This approach is left for future work. 



3.4. Consequences of non-exact recursion formula 

Previous studies are based on the implicit assumption that the underlying recursion formula 
is exact and the formula for the expected LD at equilibrium does not incorporate an error- 
term of the recursion formula. For N < 16, we showed that the error-term in the recursion 
formula can lead to a non-negligible deviance of expected LD at equilibrium. These analyses 
were restricted to small values of N due to the limited calculating capacity and only illustrate 
the effect qualitatively. It might be that the error-term becomes negligible for increasing N 
(and small values of c) so that results from previous studies remain reliable. The critical 
question remains, how reliable estimates of N e are if they are based on a non-exact recursion 
formula, and further research is needed in this field. 



3.5. Alternative approaches in the literature 

In the literature, there are several other references with alternative approaches to derive 
formulae for N e based on LD. Hayes et al. (2003), e.g., state that N e can be estimated based 
on the chromosome segment homozygosity (CSH) by using the relation CSH = 4Af 1 e+1 , 
which is the same formula one obtains based on Sved's recursion from eqn (2) and in their 
framework, the estimate of N e corresponds to the point in time "^ generations ago", as 
described earlier. However, in the course of their derivation it is assumed that the two 
considered loci behave independently, which is equivalent to Sved's questionable calculation 
of homozygosity at the second locus, given the alleles on the first locus are IBD. 
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Ohta and Kimura (1971) derived an approximate formula for the expected LD at equi- 
librium using the theory of diffusion process approximation. Here, the ratio of expectations 
instead of the expectation of the ratio is used to calculate the expected LD, resulting in 

E(r 2 ) « 5 + 2 ^ „. (12) 

v ' 11 + 26N e c + 8{N e c) 2 y ' 

McVean (2007) demonstrated that the main difference between Sved's formula for the ex- 
pected LD at equilibrium and eqn (12) is for small values of N e c: While the expected LD 
based on Sved's formula approaches 1 for c tending to zero, eqn (12) tends to a value con- 
siderably less than 1. Comparing both estimates from Monte Carlo coalescent simulation, 
McVean (2007) found that neither of the formulae provides a particularly accurate pre- 
diction for the expected value of LD at equilibrium, unless rare variants (MAF < 0.1) are 
excluded. But eqn (12) still predicts the general shape of the decrease in LD with increasing 
N e c, and it fits the simulated data better than Sved's formula when compared to a sliding 
average of simulated LD values. 

Song and Song (2007) also used diffusion process approximation to derive a formula 
for the expected LD at equilibrium for a model with recurrent mutation, genetic drift and 
recombination. Note that the considered process in diffusion approximation is continuous 
in both time and space. Diffusion processes possess many nice properties which allow the 
calculation of certain expectations at stationarity with little effort (Song and Song, 2007) . 
Song and Song (2007) were able to express the LD at equilibrium as infinite sum over certain 
terms, which in turn can be evaluated using the diffusion approximation, finally enabling 
a numerical calculation of the expected LD. One major drawback of this approach is that 
the diffusion approximation is only valid for sufficiently large populations. For Nc — > oo 
Song and Song (2007) derived a closed-form expression for the expected LD at equilibrium 
which is the same as obtained by Ohta and Kimura (1971) for the expectation of the ratio. 

Song and Song (2007) provide an approximate formula for the expected LD at equi- 
librium directly, without making a detour via a recursion formula, and despite the fact 
that derivations based on diffusion approximations are a priori valid for sufficiently large 
populations only, their approach might even constitute a reasonable approximation even 
for small values of N. So far, we have not compared the validity of the proposed formula 
for the expected LD at equilibrium in this study with the results obtained by Song and 
Song (2007), nor have we compared our AT e -estimates with estimates based on coalescent 
approaches, as e.g. proposed by Li and Durbin (2011), who estimate N e for all past times 
via a "pairwise sequentially Markovian coalescent model". These comparisons are left for 
future research. "Direct" approaches as applied by Song and Song (2007) or Li and Durbin 
(2011) can easily incorporate mutation and recombination rates and do not rely on the 
formula of Hayes ct al. (2003) for the determination of the corresponding time in point an 
-/V e -estimates refers to, whose derivation was in fact based on the concept of "chromosome 
segment homozygosity" instead of LD. 



3.6. Conclusions 

In this study, we provide a theoretical basis for modeling the evolution of LD in a finite 
population using the framework of Markov chain theory with underlying multinomial dis- 
tribution. On the basis of simulation studies, the HapMap application and the analyses of 
the state of equilibrium, we can summarize the following important points: 
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The proposed recursion formula seems to provide a better overall fit than Sved's recursion 
formula. If N is large or if c is small, differences become marginal. 

The performance of such recursion formulae heavily depends on allele frequencies, and 
LD is in general a function of the allele frequencies and the gamete frequencies. Hence, 
estimates of average LD in the population considerably depend on the distribution of MAFs 
of the SNP pairs used for estimation. Therefore, if the formula for the expected LD at 
equilibrium is used to estimate N e , this estimate will also depend on the distribution of 
MAFs of the SNPs used to calculate the average value of LD for a given genetic distance 
c. This effect was illustrated in the HapMap application. It is important to keep in mind 
that SNP arrays used in certain populations not necessarily will reflect the allele frequency 
spectrum of this population, which can bias resulting estimates of N e . 

The currently used formulae for the expected LD at equilibrium resulting from recursive 
approaches are based on the assumption that the underlying recursion formulae are correct. 
As shown in this study, one can theoretically account for the non-exactness of the recursion 
formula when deriving a formula for the expected LD at equilibrium, but exact solutions 
in this framework can so far only be obtained for small values of N due to computational 
limitations. For small values of N, the expected bias at equilibrium is non-negligible, and 
we have indicated how the effect can be approximated for larger values of N. Since the 
effect of the non-exactness might have a substantial influence on the resulting formula, as 
we have demonstrated in our empirical analyses, this might also be relevant for practical 
applications. In any case, the mathematical complexity of the problem studied warrants 
some caution when using the results. Estimates of N e based on this method should always 
be confirmed by some other, independent method (like proposed by Song and Song (2007), 
for instance), and possible sources of bias should critically be monitored. 
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4. Appendix 

4. 1 . The expected LD at equilibrium based on a recursion formula 

Given a recursion formula like eqn (6), we will derive a formula for the expected LD at 
equilibrium which is based on the theory of Markov chains (for introductory books on 
Markov chains we refer to Grimmett and Stirzaker (2001) or Norris (1997), for instance). 
Note that the derivation pertains to all recursion formulae with arbitrary coefficients a 
and b with \a\ < 1 and that we will provide a general mathematical description of the 
term "equilibrium" which will be defined as the steady-state or "equilibrium state" of the 
considered Markov chain. 

According to the multinomial model for the development of the population of gametes, 
the sequence of gamete frequencies Xt, T = to, to+1, . . . forms a homogeneous Markov chain 
with transition matrix P which is given by the multinomial distribution of the number of 
gametes of the four types in each generation. The parameters of the multinomial distribution 
are 2N and p — (x' T n , x' T i 2 ,x' T21 , x' T22 ). Since the population size is finite, the Markov 
chain has a finite set S of states si,...,s z . Here, the Sj are quadruples of frequencies 
(a: u, £12, &21; #22)) each of which describes a possible partition of the 2N gametes into the 
four types of gametes. In total, there are ( 2 2 jy 3 ) possible states (Karlin and McGregor, 
1968). Let itt, T > to, denote the probability vector of xj-. Then 

for n = 1,2, We write r\ := r 2 (x T ), E s . [r\) :— E(ry|x fo = Sj) for all T > t , and e^ 

for the j-th unit vector (e.,- = (0, . . . , 0, 1, 0, . . . , 0) where the 1 is at the j-th position). 



4.2. Step I: Assuming that the recursion formula is exact 

Let us first assume that the recursion formula (6) with coefficients a and b holds for some 
statistic r 2 depending on the time T and on the state x to in T = to. Note that the following 
derivation is valid for arbitrary values of a and b with \a\ < 1. From the recursion formula 
we get 



3 



ft E ^( r t 2 o + l) = a J2 P 3 r2 ( S ^ + b 



for all probability vectors /x = (p\, . . . ,p z ) with pj > and J^jPj = 1- With a slight abuse 
of notation, we also write E M (r|,) for the expectation of r 2 -,, given that the initial probability 
vector 7r to equals fi. Then, the last equation is equivalent to 
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The left-hand side equals 

j j i 

Hence, we have 

E li (rl +1 )=aE fl (rl) + b 
for an arbitrary initial probability vector fj,, and the weak Markov property yields 

E M (r^ +1 ) = aE M (r|) + 6 
for all T > to- If 7T( = /u, then ixt = fiP T ~ to , and the last equation is equivalent to 
^2(n T+l ) j r 2 (s j ) = a^(7r T ) J r 2 (s i ) +6. 

If the Markov chain is regular, the convergence theorem for regular discrete Markov chains 
yields ttt — > 7T* for T — > oo with n* being the unique stationary distribution, i.e. both 
sides converge and we get 

E fi (r 2 oc ) = aE lx (r 2 0O ) + b «*. R := E^(r^) = ^— (13) 

for |a| < 1. Note that we need regularity of the Markov chain when applying the convergence 
theorem and that a chain is called "regular" if some power of P contains only positive 
elements. In our setting with P based on the multinomial distribution, this is a priori 
not true since "absorbing states" in the Markov chain exist. These absorbing states reflect 
situations in which one or more alleles are fixated. We will deal with this problem later in 
this appendix. 

4.3. Step II: Dealing with non-exactness of the recursion formula 

Since we know that eqn (6) with a and b only depending on N and c is not correct, we will 
now analyze how the non-exactness of eqn (6) affects the formula R = j^ (cf. eqn (13)). 
For each state x tf) , let 

e(x t0 ) := E Xtn (r t 2 o+1 ) - ar 2 (x tQ ) - b (14) 

be the residual term of the proposed recursion formula, and let further e = (ex, ... , e z ) T — 
(e(si), . . . ,e(s z )) T be the vector of residual terms for the different states s, of the Markov 
chain. As before, we can calculate 

3 V i / 3 

leading to 

JMr 2 n+1 )=aE M (r 2 n ) + fc + M £ 



3)1 
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for an arbitrary initial probability vector ix. The weak Markov property yields 

M r r+i) = «M4) + b + /xP T - to e 

for all T > to- If ^t — Mi then 7Tt = /xP T_ *°, and the last equation is equivalent to 

^Z^T+ijj^isj) = a^( 7 r T ) j r 2 (s 3 ) + b + n T e. 



Using ttt — > 7r* for T — >• oo as before, this finally leads to 



6 + 7r*e 
1-a 



for |a| < 1. Hence the formula for the expected LD at equilibrium differs from eqn (13) by 



the summand ?-^-. 

1 — a 



4.4. Dealing with absorbing states 

In the setting so far, the Markov chain contains several "absorbing" states (states which force 
the chain to move in a certain subset of the set of states). These absorbing states correspond 
to situations in which one or two alleles at the considered two loci are already fixed. Hence, 
the Markov chain is not regular and the convergence theorem for Markov chains cannot be 
applied. Furthermore, r 2 is not defined in case one or more allele frequencies are equal to 
zero. In practice, pairs of SNPs with fixed alleles are not considered when estimating the 
expected LD in the population. Therefore, we propose to modify the transition matrix P 
of the chain by enforcing at least one immediate mutation of an allele in case this allele 
has become fixed. The corresponding rows of P are modified for the absorbing states by 
choosing the transition probabilities in these rows as indicated in Table Al mimicking the 
enforced mutations to leave the absorbing state. 

Note that, since r 2 could also not be calculated in the simulation study when one or 
more allele frequencies were equal to zero, this modification of the Markov chain does not 
influence the recursion formula and the results of the simulation study with respect to the 
goodness of fit. 

Further note that this modification is biologically inspired by the event of mutations 
and that this modification is only one possibility among others to deal with the problem of 
absorbing states. One could e.g. also discard columns and rows of absorbing states in the 
P-matrix and rescale the rows so that their sums are equal to 1. Yet, it is a priori not clear 
which effect different procedures have on the resulting stationary distribution tt* and what 
they mean in terms of a stochastic model underlying the chain. Further research is needed 
in this area. 

In the following, we will concentrate on the first possibility described above mimicking 
biological mutations. If c > 0, all transition probabilities py are strictly larger than zero for 
some power of the modified transition matrix, and the Markov chain is regular allowing for 
the calculation of expected LD at equilibrium as described in the previous sections. From 
now on, we will restrict to the modified transition matrix. 
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Table A1. Absorbing states and their modified transition probabilities. 
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with arbitrary constant e G (0, 1) 



4.5. T/?e term ^ as parameter of interest 

Let i? be as before and let i? £ denote the expected LD at equilibrium taking into account 
the error term e. Then, the relative difference between these two values is given by 



R E 
~R 



- 1 = 



b 
1-a 



- 1 = 



it e 



Hence, 



measures the relative influence of 7r*e on the expected LD. Note that 



(F(si), . . . , F(s z )) and that tt* depends on N and c. If we were able to obtain the stationary 



distribution tt* for a fixed combination of N and c, we could quantify 



The identity 



| = (F(s 1 ), . . . ,F(s z )) motivates the choice of F as a measure of goodness of fit of the 
recursion formula since we are especially interested in the expected LD at equilibrium. 



The following statistics give a first glance at the behavior of 



TT g . 
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max, \£i 



b ~* b 

Si is closely related to Figure 2 and corresponds to the negative mean of values illustrated 
in each boxplot. S2 gives an upper bound for 



4.6. Empirical analysis based on the new recursion formula 

To analyze the term 7L r^ for the new recursion formula, we repeated the simulation study 
described in the Methods section for N = 4,8, 16 and c = 0.001, 0.01, 0.05, 0.1, 0.2, 0.3 using 
the following grid for x to = (x to ,u, £t ,i2, Xt ,2i, ^0,22): 



aii ,n G < 0, 
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1 2 

27V'27V 



x to ,i2 e <^ 0, — -, — -, . . . , (1 - x t0) ii) ^ , for given x to ,u 



1 2 

2iV' 27V" 



a;* ,2i e <^ 0, — , — - , . . . , (1 - x to ,u - 2*0,12) f . for S iven »t ,iii and x to,i2- 



As mentioned before, this grid comprises the exact and full set of states of the Markov 
chain. We chose -/V samp i e so that it had approximately the same magnitude as in the previous 
simulations. 

To obtain the stationary distribution tt* , we built the transition matrix P of the Markov 
chain according to the multinomial distribution. The absorbing states listed in Table Al 
were modified as described above. Then, we calculated P" for n = 2 1 , . . . , 2 15 . At equilib- 
rium, each column of P n is constant. By graphical inspection, it could be observed that 
this situation was always reached within n = 2 15 generations so that all rows of the power 
P" were equal to the stationary distribution tv* . In practice, E Xt (rf +1 ) is estimated us- 
ing SNP pairs with non-fixed alleles in the population. Hence, we are interested in ~^- 
where 7r* and e only contain non-absorbing states. Therefore, we calculated £.;(x to ) for all 

non-absorbing states x to based on E Xt (r 2 +1 ) obtained from the simulation and rescaled 
7r* so that its sum equaled 1 after excluding all absorbing states. Then, IE j- § could be 
calculated for different (N, c)-combinations, to analyze the influence of the non-exactness 
of the recursion formula. 



